Welcome to the Wonderland of Data Science!

Our analysis uses data from 20,036 responses in the Kaggle’s 2020 survey. Although the survey is limited to the Kaggle community, we believe the insights about these Kagglers will not only apply to our community, but also generalize to a larger data science community in which we better understand the roadmap that people are taking, the tools that people are developing, and the language that people are using to communicate about data science nowadays. We decide to tell our story by means of radar chart as we want to focus on the behavioral tendencies of the respondents and compare the trends between different detected groups of data scientists within the Kaggle community.

Just like Alice going on an adventure to the Wonderland, no matter where you are on the data science path, you will definitely “grow up” and survive in this confusing yet very exciting world. We hope our analysis will serve as a guide for you to master the arts of data science and help you stay up to date with the latest trends in the field. Now, let’s get started!

Clustering analysis

In this analysis, we aim to detect different groups of data scientists within the Kaggle community using clustering technique. Questions 1, 2, 3, 4, 6, and 15 are selected as input features for clustering since demographics, education, and years of experience are all fixed properties representative of an indivdual. Given the clusters, individuals excluded from the survey will be able to identify the group most relevant to them based on their background information and gain useful insights regarding career tracks, technical skills, and learning resources through our analysis.

The input questions are as below:

Method

Since we have 6 input questions corresponding to 6 dimensions, summarizing and visualizing data can be difficult while clustering might be ineffective in high-dimensional space. To overcome these limitations, we propose first using correspondence analysis (CA), an extension of principal component analysis suited to explore relationships among qualitative variables. This technique is generally used to analyze a data set from survey to identify the associations between variable categories. Since our data contains more than two categorical variables, we choose to use multiple correspondence analysis (MCA) as it can be applied to input of multiple dimensions. However, MCA has a disadvantage that it tends not to explain all the variance while looking at associations. For the purpose of this analysis, we select the first two dimensions which altogether account for 10.5% of the total variances in the survey’s answers.

Our goal is to better understand the specific communities represented in the survey. Thus, we propose using K-means clustering, one of the simplest and most popular clustering that aims to group similar data points together and discover underlying patterns. To process the learning data and detect clusters, K-means identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. After examining different possible values of k, we decide to pick k = 3. This will help the model to avoid overfitting while ensuring its descriptive performance. Contextually, we believe 3 is a suitable number of groups of similar respondents that will allow us to dive deep into the survey dataset and tell the diverse stories of data scientists from around the world.

Results

Group Count
Experts 5966
Junior Players 12039
Amateur Outsiders 2031

Text goes here

Group Age Continent Degree Years of Coding Years of Machine Learning
Experts 30-34 Europe Master’s degree 5-10 years 2-3 years
Junior Players 22-24 Asia Bachelor’s degree 1-2 years Under 1 year
Amateur Outsiders 25-29 Asia Bachelor’s degree I have never written code NA

Overall, there are three representatives in the Kaggle community:

  • Amateur Outsiders
Alice and the Rabbit Hole

Alice and the Rabbit Hole

The first group consists of individuals in the age of 25-29, mostly with a Bachelor’s degree but neither coding nor machine learning experience. They seem to be the amateur outsiders with some interest in data science and would like to know more or enter the field. This group is the smallest of the three groups, accounting for roughly 10% of the sample.

  • Junior Players
Alice at the Mad Tea Party

Alice at the Mad Tea Party

The second group consists of individuals in the age of 22- 24, mostly with a Bachelor’s degree and some years of coding but not much machine learning experience. They seem to be fresh graduates or beginners in the data science world who already know what’s in there and are on the way of discovering more about the path. This group is the largest among the three groups, accounting for roughly 60% of the sample.

  • Experts
Alice Waking Up at the Riverbank

Alice Waking Up at the Riverbank

The third group consists of individuals in the age of 30 – 35, mostly with a Master’s degree and many years of both coding and machine learning experience. We can call them the experts of the game as they seem to be much more proficient in data science more than anyone else. This group is the second largest among the three groups, accounting for roughly 30% of the sample.

Output

Insights about learning process

  • Q37: On which platforms have you begun or completed data science courses?

  • Q38: What is the primary tool that you use at work or school to analyze data?

  • Q39: Who/what are your favorite media sources that report on data science topics?

Insights about technical skills

  • Q7: What programming languages do you use on a regular basis?

  • Q8: What programming language would you recommend an aspiring data scientist to learn first?

  • Q9: Which of the following integrated development environments (IDE’s) do you use on a regular basis?

  • Q14: What data visualization libraries or tools do you use on a regular basis?

  • Q16: Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis?

  • Q29-A: Which of the following big data products (relational databases, data warehouses, data lakes, or similar) do you use on a regular basis?

  • Q31-A: Which of the following business intelligence tools do you use on a regular basis? (Select all that apply)

  • Q36: Where do you publicly share or deploy your data analysis or machine learning applications?

Insights about jobs, salaries, companies

  • Q5: Select the title most similar to your current role (or most recent title if retired)

  • Q20: What is the size of the company where you are employed?

  • Q21: Approximately how many individuals are responsible for data science workloads at your place at business?

  • Q22: Does your current employer incorporate machine learning methods into their business?

  • Q23: Select any activities that make up an important part of your role at work

  • Q24: What is your current yearly compensation (approximate $USD)?

Conclusion